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Abstract 

We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with 
expert advice, under both full information and bandit feedback. We measure the player's performance using a new 
notion of regret, also known as policy regret, which better captures the adversary's adaptiveness to the player's 
behavior. In a setting where losses are allowed to drift, we characterize — in a nearly complete manner — the power 
of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switching costs, 
the attainable rate with bandit feedback is 0(T 2 / 3 ). Interestingly, this rate is significantly worse than the 9(\/T) 
rate attainable with switching costs in the full information case. Via a novel reduction from experts to bandits, we 
also show that a bounded memory adversary can force Q(T 2 ^ 3 ) regret even in the full information case, proving that 
switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic 
adversary strategy that generates loss processes with strong dependencies. 



1 Introduction 



An im portant instance of the framework of prediction with expert advice — see, e.g., IICesa-Bianchi and Lugosi , 



2006]— is defined as the following repeated game, between a randomized player with a fixed set of available ac 



tions, and an adversary. At the beginning of each round of the game, the adversary assigns a loss to each action. Next, 
the player defines a probability distribution over the actions, draws some action from this distribution, and suffers the 
loss associated with that action. The game repeats for T rounds and the player's goal is to accumulate the smallest 
possible loss. In the literature, two basic versions of this game are considered: in the bandit feedback version, the 
player observes the loss associated with his action on each round, but he does not observe the loss values associated 
with the actions that he did not choose on that round; in the full- information feedback version of the game, the player 

observes the adversary's entire assignment of loss values at the end of each round. 

We assume that the adversary is adaptive (also called nonoblivious bv lCesa-Bianchi and Lugosi Il2006ll or reac- 



tive by iMaillard and Munosl 1201011 ). which means that the adversary chooses the loss values on round t based on the 



player's actions on rounds 1, . . . , t — 1. We also assume that the adversary is deterministic and has unlimited compu- 
tational power. These assumptions imply that the adversary can specify his entire strategy before the game begins. In 
other words, the adversary can perform all of the calculations needed to specify, in advance, how he plans to react to 
any sequence of actions chosen by the player. 

More formally, let A denote the set of actions, let X t denote the player's random action on round t, and assume that 
the adversary defines, in advance, a sequence of history-dependent loss functions fi, . . . , fx- The input to each loss 
function f t is the entire history of the player's actions so far, therefore the player's loss on round t is ft(X\, . . . , X t ). 



1 



Note that the player does not observe the functions f t , only the losses obtained conditioned on his past plays. Specif- 
ically, in the bandit feedback model, the player observes ft(X\, . . . , X t ) on round t, whereas in the full-information 
model, the player observes ft(Xi, . . . , X t -i,a) for all actions a € A. 

We evaluate the player's performance using the notion of regret, which compares his cumulative loss to the cu- 
mulative loss of the best fixed action in hindsight. Formally, the player's regret at the end of the game is defined 
as 
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(1) 



Note that this is a random variable over the player's possibly rand omized strategy, and we thu s also consider the 
expected regret E[_Rt]- This definition is the same as the one used bv lMerhav et al.l 1200211 and by lArora et al.l 1120 1 2 1 
(where it is called policy regret), but differs from the following alternative definition of expected regret 
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This second not i on of regret is more common in t he literature (e.g., IIAuer et al.l 120021 iMcMahan and Bluml 12004 , 
Dani and Haves . 120061 iMaillard and Munol l2010ll ). but is clearly inadequate for measuring a player's performance 
against an adaptive adversary. Indeed, with respect to such a n adversary the qua ntities ft(X\, . . . , Xt-: 



appearing 



in the second term of Eq. ||2}, are hardly interpretable — see llArora et all 1201211 for a more detailed discussion. 

In general, we seek algorithms for which E[ifcr] grows sublinearly w ith T, implying that the per-round expected 



regret, E[Rt]/T, goes to zero with T. Unfortunately. I Arora et al.l 1120 1211 also show that arbitrary adaptive adversaries 
can easily force the regret to grow linearly. Thus, we need to focus on (reasonably) weaker adversaries, which have 
constraints on the sequence of loss functions they can generate. 

The weakest adversary we discuss is the oblivious adversary, which determines the loss on round t based only on 
the current action X t . In other words, this adversary is oblivious to the player's past actions. Formally, the oblivious 
adversary is constrained to choose a sequence of loss functions that satisfies 



Vt, V (xi, . . . ,x t ) € A\ V(x'x, . 



H-l 



)eA t ~ 1 ft{x u ...,x t ) = ft(x' 1} 



Xt ) 



(3) 



The majority of previous work in online learning focuses on oblivious adversaries. When dealing with oblivious 
adversaries, we denote the loss function by £ t and omit the first t—1 arguments. With this notation, the loss at time t 
is simply written as £ t (X t ). 

For example, imagine an investor that invests in a single stock at a time. On each trading day he invests in one 
stock and suffers losses accordingly. In this example, the investor is the player and the stock market is the adversary. 
If the investment amount is small, the investor's actions will have no measurable effect on the market, so the market is 
oblivious to the investor's actions. Also note that this example relates to the full-information feedback version of the 
game, as the investor can see the performance of each stock at the end of each trading day. 

A stronger adversary is the oblivious adversary with switching cost. This adversary is similar to the oblivious 
adversary defined above, but charges the player an additional switching cost of 1 whenever X t ^ X t -\. More 
formally, this adversary defines his sequence of loss functions in two steps: first he chooses an oblivious sequence of 
loss functions, £i, , . , , £j>, which satisfies the constraint in Eq. ©. Then, he sets f\{x) = £i(x), and 

Vi>2, f t (x 1 ,...,x t ) = e t (x t ) +l {xt ^ Xt _ l} . 

This is a very natural setting. For example, let us consider again the single-stock investor, but now assume that 
each trade has a fixed commission cost. If the investor keeps his position in a stock for multiple trading days, he is 
exempt from any additional fees, but when he sells one stock and buys another, he incurs a fixed commission. More 
generally, this setting (or simple generalizations of it) allows us to capture any situation where choosing a different 
action involves a costly change of state. In the paper, we will also discuss a special case of this adversary, where the 
loss function £t(x) for each action is sampled i.i.d. from a fixed distribution. 

The switching cost adversary defines the loss on round t as a function of X t and Xt-x, and is therefore a special 
case of a more general adversary called an adaptive adversary with a memory of 1. This adversary is constrained to 
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choose loss functions that satisfy 



Vi, y(x 1 ,...,x t )eA t , V(zi,. 



H-2 
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He is more general than the switching cost adversary because his loss functions can depend on the previous action in 
an arbitrary way. We can further strengthen this adversary and define the bounded memory adaptive adversary, which 
has a bounded memory of an arbitrary size. In other words, this adversary is allowed to set his loss function based on 
the player's m most recent past actions, where to is a predefined parameter. Formally, the bounded memory adversary 
must choose loss functions that satisfy, V t, V (xi, ■ . ■ , xt) G A f , V(x' 1 , . . . , x' t _ m _ 1 ) G .A* -17 * -1 , 



ft(xi,...,x t ) = ft{x[, 



-i,x t - 



In the in formation theory lite rature, this setting is called individual sequence prediction against loss functions with 



memory IMerhav et all 2002] 



The bounded memory adap tive adversary has a dditional interesting special cases. One of them is the delayed 
feedback oblivious adversary of lMesterharml Il2005ll . which defines an oblivious loss sequence, but reveals each loss 
value with a delay of m rounds. Since the loss at time t depends on the player's action at time t — m, this adversary 
is a special case of a bounded memory adversary with a memory of size to. The delayed feedback adversary is not a 
focus of our work, and we present it merely as an interesting special case of the bounded memory adversary. 

So far, we have defined a succession of adversaries of different strengths. This paper's goal is to understand the 
upper and lower bounds on the player's regret in face of these adversaries (focusing on the dependence on T), with 
either full-information or bandit feedback. 



1.1 The Current State of the Art 



Different aspects of this problem have been previously studied and the known results are surveyed below. Most of 
these previous results rely on the additional assumption that the range of the loss functions is bounded in a fixed 
interval (say, the unit interval [0, 1]). We explicitly make note of this because our new results actually require a slightly 
weaker assumption. 

As mentioned above, the oblivious adversary has been studied extensively and is the bes t understood of all the ad- 
versaries discussed in this paper. W ith full-information feedback, both the Hedge algorithm I Litflestone and Warmuthj, 
1994 iFreund and Schapirei Il997ll and the follow the perturbed leader (FPL) algo rithm llKalai and Vempalal l2005ll 
guarantee a regret of 0(VT), with a matching lower bound of Q(\/T) — see, e.g., OCesa-Bianchi and LugosiL 1200611 . 
Analyses of Hedge in set tings where the loss range may vary over time have been also considered — see, e.g., 
I Cesa-Bianchi et allE007ll . 

The oblivious setting with bandit feedback, where the player only observes the incurred loss ft(X\, . . . ,X t ), is 
called the nonsto chastic (or adversarial) multi-armed bandit problem. In this harder setting, the Exp3 algorithm of 
Auer et al. l2002ll guarantees the same regret 0(yT) as the full-information setting, and clearly the full-information 
lower bound fl(\/T) still applies. 

The follow the lazy leader (FLL) algorithm of iKalai and Vempalal 1 2005 1 is designed for the switching costs set- 
ting with full-information feedback. The analysis of FLL guarantees that the oblivious component of player's regret 
(without counting the switching cost) is upper bounded by 0(y/T), and additionally, the number of switches is upper 
bounded by O(VT). Together, these two bounds imply that the overall regret (including the switching costs) is upper 
bounded by 0(VT) 

The algorithm of iMerhav et al. 1 2002 1 focuses on the bounded memory adversary with full-information feedback, 
referring to t his problem as loss functions with memory. This algorithm guarantees a regret upper-bound of 0(T 2 / 3 ). 
The work of lArora et al.l 1120 1211 extends this result to the bandit feedback case, maintaining the same upper bound of 
0(T 2 / 3 ). 

Learning with bandit feedback and switching costs has mostly been c onsidered in the economics literature, using 
a different setting than ours and with prior knowledge assumptions (see Ijunl H2004I1 for an overview). The setting 
of stochas tic oblivious adversa ries (i.e., oblivious loss functions sampled i.i.d. from a fixed distribution) was first 
studied bv lAgrawal et al. 1 198811 . where they show that o(log T) switches are sufficient to guarantee logarithmic regret 
asymptotically. lOrtnei "ll2010ll achieves logarithmic regret nonasympto tic ally with C(log T) switches. 
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Table 1 : State-of-the-art upper and lower bounds on regret against different adversary types. Our main new results are 
presented in bold face. 



S everal other papers discuss online learning again s t "ada ptive" adversaries IIAuer et al.i |2002[ iDani and Haves 



20061 iMaillard and Munosl |2Q10L iMcMahan and Bluml 1200411 . but these results are not relevant to our work and can 



be easily misunderstood. For example, even the Exp3 algorithm of lAuer et al. I Il2002ll has extensions to the "adaptive' 1 
adversary case, with a regret upper bound of VT. This bound does not contradict the S1(T) lower bound for general 
adaptive adversaries mentioned earlier, since these papers use the standard regret Eq. (|2) rather than the policy regret 
Eq. (Q]). In this paper, our focus is on the latter, and therefore many of the previous results on learning against "adaptive" 

adversaries are not relevant to our work. 

Another related body of work lies in the field of competitive analysis — see IIBorodin and El-Yanivll 199811 . which 
also deals with loss functions depending on the player's past actions, and the adversary's memory may even be un- 
bounded. However, obtaining sublinear regret is generally impossible in this case. Therefore, competitive analysis 
studies much weaker performance metrics such as the competitive ratio, making it orthogonal to our work. 



1.2 Our Contribution 

In this paper, we make the following contributions (see Table[TJ: 

• Our main technical contribution is a new lower bound on regret that matches the existing upper bounds in several 
of the settings discussed above. Specifically, our lower bound applies to the switching cost adversary with bandit 
feedback (and therefore also to all strictly stronger adversaries with bandit feedback). 

• Building on this lower bound via a simple yet nontrivial argument, we prove a lower bound on regret in the 
bounded memory setting with full-information feedback. 

• Despite the lower bound, we show that for switching costs and bandit feedback, if we also assume stochastic 
Ltd. losses, then one can get a distribution-free regret bound of OtVTloglog lo gT) for finite action sets, with 
just C(log log T) switches. The result uses ideas from lCesa-Bianchi et al. |2013|, and is shown in AppendixlAl 



Our new lower bound is a significant step towards a complete understanding of adaptive adversaries; observe that 
the upper and lower bounds in Table[T]match in all but one of the settings. 

Two consequences stand out from the rest. First, we see that the optimal regret against the switching cost adversary 
is Q(Vr) with full-information feedback, versus 6(T 2 / 3 ) with bandit feedback. To the best of our knowledge, 
this is the first theoretical confirmation that learning with bandit feedback is strictly harder than learning with full- 
information, even on a small finite action set and even in terms of t he dependence on T (previous gaps we are aware 
of were e ither in terms of the number of acti ons 1 Auer et all 2002 1. or required large or continuous action spaces — 
see, e.g., IIBubeck et all 1201 lUShamiru2012IO . Moreover, recall the regret bound of O (VT log log log T) against the 
stochastic i.i.d. switching cost adversary in the bandit setting. This demonstrates that dependencies in the loss process 
must play a crucial role in controlling the power of the switching cost adversary. Indeed, the fl(T 2 / 3 ) lower bound 
proven in the next section heavily relies on such dependencies. 

Second, we observe that in the full-information feedback case, the optimal regret against a switching cost adversary 
is Q(Vt), whereas the optimal regret against th e more genera l boun ded memory adversary is 0(T 2 / 3 ). This is 
somewhat surprising given the ideas presented by Merhav et al.1 112.00211 : they take an algorithm originally designed 
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for oblivious adversaries, forcefully preventing it from switching actions very often, and obtain a new algorithm that 
guarantees a regret of 0(T 2 / 3 ) against bounded memory adversaries. This would seem to imply that a small number 
of switches is the key to dealing with general bounded memory adversaries. Our result contradicts this intuition by 
showing that controlling the number of switches is easier then dealing with a general bounded memory adversary. 

As noted above, our results require us to slightly weaken the standard technical assumption that loss values are 
bounded in [0, 1]. Instead, we make the following two assumptions: 

1 . Bounded range. We assume that the loss values are contained in a bounded interval on each individual round, 
but this interval can drift from round to round. Formally, 

Vt, V (xi, . . . ,x t ,x) G A t+1 \ft(xx, . . . ,x t ) - ft(xi, ■ ■ ■ ,x t -i,x)\ < 1 . 

2. Bounded drift. We also assume that the drift of each individual action from round to round is contained in a 
bounded interval of size D t , which can vary with time. Formally, 

Vi, V (xi, . . .,x t -i,x) eA* \f t (xi,.. .,x t -i,x) - f t +i(xi,. . .,x t -i,x,x)\ < D t . 

Since these assumptions are a relaxation of the standard assumption, all of the existing lower bounds automatically 
extend our relaxed setting. For our result to be consistent with the current state of the art, we need to prove that all of 
the upper bounds in previous works continue to hold after the relaxation. 

2 Lower Bounds 

2.1 f](T 2 / 3 ) Lower Bound for Bandits with Switching Costs 

We first turn to prove our il(T 2 / 3 ) regret lower bound, for the case of bandits with switching costs. It is enough 
to consider a very simple setting, where we only have two actions, labeled 1 and 2. Keeping in with the notation 
previously introduced, we use l\, £2, ■ . ■ to denote the oblivious sequence of loss functions chosen by the adversary 
before adding the switching cost. Our main result here is as follows: 

Theorem 1. For any (possibly randomized) player strategy, there exists an oblivious loss sequence l\, £2, ■ ■ ■ over 
A = {1,2}, with bounded drift Dt < y/S log(T) + 9 and bounded range (more specifically, max«T |^t(l) — 
^t(2)| < T -1 / 3 ), such that the player's expected regret after any number T of plays is at least -^T 2 ^. 

The proof idea is rather intuitive; we sketch it below before proceeding to the formal proof. 

We consider the following randomized adversarial strategy where e is a parameter (which will eventually be chosen 
as T" 1 / 3 ) and £1, £2, ■ • ■ denotes an i.i.d. sequence of standard Gaussian random variables, with zero mean and unit 
variance. Initially, the adversary picks a number e > and flips its sign with probability 1/2. Then, the adversary 
samples an oblivious loss sequence {£t(l), £t(2)}f =1 from the set of random variables {L t (l), L t (2)}J =1 , defined as 
follows: for each t = 1, 2, . . . 

t 

L t( 1 )=J2^ L t (2)=e + L t (l). (5) 

s=l 

In words, {L t (1)} is simply a Gaussian random walk, and {L t (2)} is the same random walk, shifted by e — see figure 
Q]for an illustration. 

We show that the expected regret of any player (over coin flips of both the adversary and the player) is f2(T 2 / 3 ). 
This is shown to imply that there exists a deterministic loss sequence (i.e., a deterministic adversary), such that the 
regret of any algorithm is at least Q(T 2 / 3 ), and the losses satisfy the drift constraint described above. 

The high-level intuition of the proof is that the player can only gain information on which action is better by 
switching between them: By staying on the same action, he will only observe a standard Gaussian walk, and get no 
further information. By picking e = T~ 1//3 , we show that fi(T 2 / 3 ) switches will be needed to distinguish the better 
action, but then the total regret incurred due to switching costs will be O(eT) = S7(T 2 / 3 ). Alternatively, the player 
may choose not to perform enough switches to distinguish the better action, but then he incurs e = T -1 / 3 expected 
regret per round, and the expected regret will again be n(T 2 / 3 ) overall. 
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Figure 1: A particular realization of the loss sequence as defined in Eq. ©. The sequence of losses for action 1 
follows a Gaussian random walk, whereas the sequence of losses for action 2 follows the same random walk, but 
slightly shifted by some parameter e (which is randomly chosen to be either positive or negative) 



Proof of Thm.3] 

As mentioned in the text, we consider a randomized adversary strategy as described in Eq. (0. Following standard 
bandit lower bounds, we assume for now that the player's strategy is deterministic. Since any randomized strategy 
can be seen as randomly picking a deterministic strategy, and since the randomization used by the adversary does not 
depend on the player's strategy, it is enough to prove that for any deterministic player strategy, the expected regret (over 
the adversary's random strategy) is f2(T 2 / 3 ). Also, note that for deterministic player strategies, we can assume that 
the player's action X t on round t is a deterministic function of the random losses Li(Xi), . . . , L t -i{X t -i) already 
seen. 

In the results below, we let P denote the joint distribution of the player's losses, and introduce the two conditional 
distributions § = P(- | e > 0) (when 1 is the best action and 2 is the worst action) and Q = P(- | e < 0) (when 2 is 
the best action and 1 is the worst action). Hence, since the sign of e is drawn at random, P = i(S + Q). We use the 
following technical lemma, whose proof appears in Appendix lB.il 

Lemma 1. Let l^ Xtl -^ Xf y indicate whether the player switched actions on round t (and 1 for t = 1). Then for any 
event A, 



\S(A) - ®(A)\ < ( 



\ 



E 



where the expectation in the right-hand side is with respect to P. 

With this lemma, we can prove a lower bound on the expected regret for randomized adversaries. 

Lemma 2. By picking e = T" 1 / 3 , the expected regret of any deterministic player strategy, over the randomness of the 
adversary, is at least j^T 2 / 3 . 

Proof. Let A be the event that the worst action (action 2 if e > 0, and 1 if e < 0) was picked by the player at least T/2 
times. Also, let St = Y^t=i ^{x t ^x t -i} t> e tri e number of switches the player performs. Then 



E[R T ] > 



< &t, — 1 {A} 



Y 



> E 



= l -E[S T ] + ^P(A) 
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Moreover, letting A\ denote the event that the player chose action 1 at least T/2 times, and letting A 2 denote the event 
that the player chose action 2 at least T/2 times, we have F(A) = \ (S(A 2 ) + <Q>(Al)) . Substituting this, we get 

iE[5 r ] + y(S(A a ) + Q(Ai)). 
Using Lemma[T]to lower bound Q(A\) via S(Ai), we get a lower bound of 

^E[S T ] + y (S(A 2 ) + S(Ai) - eV^5rl) > ^E[5 T ] + ^ (s(4i U A 2 ) - e^E[S^j 



where we used a union bound and the fact that either A\ or A 2 always holds. This is a quadratic function of yTEpJji] 
and it is easily verified that the lowest possible value it can attain (for any value of E[Sr]) is 

eT e 4 T 2 
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Picking e = T- 1 / 3 , this equals (± - T 2 ' 3 > ±T 2 ' 3 . □ 

The lemma above tells us that for the randomized adversary strategy we have devised, the expected regret for any 
deterministic player is at least -^T 2 / 3 . This implies that there exist some deterministic adversarial strategy, for which 
the expected regret of any possibly randomized player is at least y^T 2 / 3 . However, we are not done yet, since this 
strategy does not guarantee that the losses have bounded drift: In our case, the variation is governed by a potentially 
unbounded Gaussian random variable, so the deterministic adversary strategy that we picked might have an arbitrarily 
large drift. So now, we show that there exists some deterministic adversarial strategy for which the expected regret is 
large, and the variation is bounded. We utilize the following lemma, whose proof appears in Appendix[ 



Lemma 3. Let Y be a random variable in [—6, b] (where b > 0), and~E[Y] > c for some c € [0, b/2]. Then we have 

^ Y ^ c ' 2 ^^-^Yb- 

We let E p i ayer [Rt] denote the player's expected regret, conditioned on the adversary's randomization. By Lemma|2] 
we know that E[E p i ayer [i?T]] > TqT 2/3 - a 1 so - |E p i ay er[-RT] | is at most T + T 2/3 < 2T (since the regret is upper 
bounded by the maximal total switching cost T, and the maximal possible times we can pick the action which is worse 
by e = T -1 / 3 , or T 2 / 3 overall). Using Lemma[3] we have 

Now recall that £i j . . . , £t are the Gaussian random variables used by the randomized adversarial strategy. By a 
standard Gaussian tail bound, we have for any 6 <G (0, 1) that 

P (it : |6| > \/21og(T/<5)) < 6. 

In particular, by picking 5 = -^T -1 / 3 , this implies that 

Vt : |6I < ^/21og(80T4/3)^ > 1 - ^T~ 1/3 . (7) 

Combining Eq. © with Eq. (0, and observing that -^T^ 1 / 3 + 1 — ^T^ 1 / 3, > 1, we get that the two events whose 
probabilities we have lower bounded must have a nonempty intersection. In other words, there exists some adversarial 
strategy, such that E player [ii T ] is at least ^T 2 / 3 , and the drift D t is at most y/2 log(80T 4 / 3 ) < ^3 log(T) + 9 for all 
t = l,...,T. 



7 



2.2 f2(T 2 / 3 ) Lower Bound for Full-Information with Bounded Memory 

In this subsection, we build on the previous lower bound to prove a Q (T 2 / 3 ) regret lower bound even in the full - 
information setting, where we get to see the entire loss vector on every round. To get this strong result, we need to add 
just a little bit of power to the adversary: instead of having a memory of size 1 as in the case of switching costs, we 
need to endow it with a memory of size 2. To show this result, we again consider a simple setting with two actions. 
The formal guarantee is the following. 

Theorem 2. Suppose T > 1. Then for any (possibly randomized) player strategy, there exists a fixed sequence of loss 
functions • • • , /t}, with a memory of size 2 and drift D t < y/4log(T) + 9 for all t < T, such that the player's 
expected regret is at least 0.015 T 2 / 3 . 

The theorem is based on a reduction from full-information to bandit feedback that might be of independent interest. 
On each round t, the adversary assigns the same loss value to both actions. This value depends on the player's actions 
on the previous rounds, but not on his current action x t . Since the player doesn't know what the losses would have 
been had he chosen different actions on the previous rounds, the full-information game becomes as hard as a bandit 
feedback game. In our case, the loss value on round t of the full information game is set to equal the loss value on 
round t — 1 of a bandits-with-switching-cost game in which the player chooses the same sequence of actions. This can 
be done with a memory of size 2, since this loss value is fully specified by the player's choices on rounds t — 1 and 
t — 2. The £!(T 2 / 3 ) lower bound for the bandit game with switching costs extends to the full information setting with 
a bounded memory of size at least 2. A formal proof of the theorem appears in Appendix IB. 3 1 



3 Upper Bounds 

In this section we prove that the known upper bounds on regret, which were originally proven for bounded losses, can 
be extended to the case of losses with bounded range and bounded drift. We do this via two simple reductions. 

1. Full information reduction: f' t (x\, . . . ,x t ) = ft{%i, ■ ■ ■ ,%t) — mm ft(%i, ■ ■ ■ , Xt-i,x) . 

Obviously, this reduction can be computed in the full information setting as f t is revealed after each step t. 
Moreover, because of the bounded range assumption, f' t 6 [0, 1] for all t. 

2. Bandit reduction: f' t {X u ...,X t ) = f t (X u ...,X t )- f t -i(X u . . . ,X t ^) + 1 + D. 

Here D > D t for all t > 1. Clearly, this reduction can be computed by any player in the bandit setting as 
both f t {Xi,. . . , X t ) and / t _i(Xi, . . . , X t -\) are observed. Note also that < f t < 2(1 + D). Similarly, 
f t (X 1 ,...,X t )-f t _ 1 (X 1 ,...,X t _ 1 ) < 1 + £>«_! . 

In the full information case, the extension to bounded drift is very easy. While the reduction preserves any regret 
bound of the form Q(T a ), it may affect — in certain cases — more sophisticated second-order regret bounds, such 



as those proven by iCesa-Bianchi et al.l H2007H ■ In the theorem statement below, we use "adversary type" to denote 
any particular adversary we discussed in the paper - bounded memory, oblivious, switching costs, and stochastic i.i.d. 
However, the result clearly generalizes to other types of adversaries, with other constraints on how they may choose 
the loss functions. 

Theorem 3. For any adversary type, let Abe a player strategy for the full information game with policy regret bounded 
by O (T Q ) against any adversarial strategy f[ , , . . . , where each f' t is a bounded loss function. Then, there exists a 
player strategy A' for the full information game with the same adversary type, whose policy regret is also bounded by 
O [T a ) against any adversarial strategy /i , /2 , • • • with bounded range and bounded drift. 

Proof. The statement for the full information game is proven by defining A' as the strategy that predicts like A and, 
after each prediction, feeds A with the loss function transformed via the full information reduction. Note that the 
reduction does not change the memory size of the adversary. Since the terms min^ ft{x\, ■ ■ ■ , Xt-i,x) cancel in the 
regret, and since the transformed loss is bounded, any regret bound for A of the form 0(T a ) also bounds the regret 
of A'. □ 
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Next we show the extension for the bandit setting. In this case, the reduction turns an oblivious adversary into an 
adaptive adversary. Therefore, we need to use as building block a player strategy that is guaranteed to work against 
adaptive adversaries. Unlike the full information case, the reduction requires a known bound D on the range and drift. 

Theorem 4. Let A be a player strategy for the bandit game with standard regret bounded by C(T Q ) against any 
adaptive adversarial strategy f[ , f 2 , . . . , where each f' t is a loss function with bounded range and no drift. Then, for 
any D > 0, there exists a player strategy A' for the bandit game whose regret is bounded by 0(T 1 '^ 2 ~ a '^ against 
any bounded memory adversarial strategy /i , /a, . • ■ with bounded range and drift uniformly bounded by D. 

Proof. The strategy A 1 is the mini-batched version of A in which each loss in a mini-batch step is transformed via the 
bandit reduction. Let Z\, Z 2 be the actions played by A in the mini-batch steps, let g±, g%, . . . be the loss functions for 
the mini -batch step, and let g[, g' 2 , , . . their transformed versions. It is then immediate to verify that the standard regret 
of A w.r.t. g[ , g' 2 , . . . is equal to the standa rd regret of A w.r.t. the original sequence g\ , g 2 , . . . of losses. The proof is 



concluded by applying lArora et aUl2012l Theorem 2] to relate standard regret to regret in the mini-batch setting. □ 



Using the Exp3 strategy of lAuer et al.l II2002II we can easily obtain the following result, whose proof appears in 
Appendix lB.4l 

Corollary 1. In the bandit game, for any D > 0, 

1. there exists a player strategy whose regret is bounded by 0(a/T) against any oblivious adversarial strategy 
with bounded range and drift uniformly bounded by D; 

2. there exists a player strategy for the bandit game whose regret is bounded by 0(T 2 / 3 ) against any bounded 
memory adversarial strategy with bounded range and drift uniformly bounded by D. 



4 Discussion 

In this paper, we studied the problem of prediction with expert advice against different types of adversaries, ranging 
from the oblivious adversary to the general adaptive adversary. We proved upper and lower bounds on the player's 
regret against each of these adversary types, in both the full information and the bandit feedback models. Our lower 
bounds matched our upper bounds in all but one case: the adaptive adversary with a unit memory in the full information 
setting, where we only know that regret is Q(\/T) and 0(T 2 / 3 ). Closing this gap is an important open problem. 

Our new bounds have two important consequences. First, we characterize the regret attainable with switching 
costs, and show a setting where predicting with bandit feedback is strictly more difficult than predicting with full 
information feedback — even in terms of the dependence on T, and even on small finite action sets. Second, in the 
full information setting, we show that predicting against a switching cost adversary is strictly easier than predicting 
against an arbitrary adversary with a bounded memory. 

To obtain our results, we had to relax the standard assumption that loss values are bounded in [0, 1]. Re-introducing 
this assumption and proving similar lower bounds remains an elusive open problem. 

Our work also leaves many other questions unanswered. Can we characterize the dependence of the regret on 
the size of the action set Al Can we strengthen any of our expected regret bounds to bounds that hold with high 
probability? Can any of our results b e generalized to more sophisticated notions of regret, such as shifting regret and 



swap regret, as in I Arora et aU 2012 1? 

In addition to the adversary types discussed in this paper, there are other interesting classes of adversaries that 
lie between the oblivious and the adaptive. One of these is the oblivious adver sary with delayed f eedback, briefly 



mentioned in the introduction. While some results for this adversary exist (see iMesterharml 1)2005]), the attainable 
regret, especially in the bandit feedback case, is not clear. Another interesting case is the family of deterministically 
adaptive adversaries, which includes adversaries that adapt to the player's actions (so they are not oblivious) in a 
known deterministic way, rather than in a secret malicious way. For example, imagine playing a multi-armed bandit 
game where the loss values are initially oblivious, but whenever the player chooses an arm with zero loss, the loss of 
the same arm on the next round is deterministically changed to zero. In other words, whenever the player suffers a zero 
loss, he knows that choosing the same arm again guarantees another zero loss. This is an important setting because 
many real-world online prediction scenarios are indeed deterministically adaptive. 
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A Distribution-free regret bound for bandits with switching costs 

In this appendix we adapt results of ICesa-Bianchi et al.l ll2013ll to show a strategy that achieves O ( \JT log log log T) 
regret against any i.i.d. oblivious adversary in the bandit setting with switching costs, assuming a finite action set 
A = { 1, . . . , K} . The strategy used by this stochastic adversary is specified by a probability distribution over oblivious 
loss functions. The oblivious loss function for each step t = 1,2,... is the realization on an independent draw L t 
from this distribution. The regret of a player choosing actions Xq = X\, X2, ... is defined by 

T T 

Rt = J2M L t(Xt)+hw t . l} ] -muiJ^E [£,(*)] 
t=i x t=i 

where the expectation E is over the random draw of each L t and the possible randomization of the player, and the 
expectation E t is conditioned over Xi, Li(Xi), . . . , X t -i, L t -i(X t -i). 

Our result focuses on loss distributions such that the law of each marginal L\(x) is subgaussian. A random 
variable Z is subgaussian if there exist constants b, c such that for any a > P(Z > EZ + aj < be~ ca ~ and 
P(Z < EZ — aj < be~ ca . One can then show that, for any i.i.d. sequence Z\, . . . , Zt of subgaussian random 
variables, 



\ t=i 



In the following, we use the notation E[L f (x)l = /j,(x) and /i* = min /i(x) . 

Theorem 5. Consider a finite action set A = {1, . . . , K}. Then for each T there exists a deterministic player 
strategy for the bandit game with i.i.d. oblivious adversaries and switching costs, whose regret after T steps is 
O (y/T log log log J 1 ) with high probability, provided the distribution of L\{x) is sugaussian for each x G A. 

Proof. Consider the following player that proceeds in stages. At each stage s = 1, 2, . . . , S, the player maintains a set 
A s C A of active actions. Each action is played T S /|A S | times in a round-robin fashion, where T s = T 1 ^ 2 is the 
total number of plays in stage s and T is the known horizon. Note that the overall number of switches is at most KS, 
where 

S = min jj e N : ^ T s > T j = O(lnlnT) . 

Let /is (x) the sample mean of losses for action x in stage s, and define 

x s = argmin/I s (x) 

the best empirical action in stage s. The sets A s of active actions are defined as follows: A\ = A and 

A s = (x e : fi s -±{x) < j2 s -i{xs-i) + 2C s _i| 

where 



112(6/c)^ln^ 



Note that As C • • • C A\ by construction. Also, using (H) and the union bound we have that 



maxims (x) — l^(x)\ < C s (9) 



simultaneously for all s = 1 , . . . , S with probability at least 1 — 5. 
We claim the following. 
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Claim 6. With probability at least 1 — 6, 

s 

x* G O A s and < fi s {x*) — fi, s (x s ) < 2C S for all s = 1, . . . , S. 

8=1 

Proof of Claim. We prove the lemma by induction on s = 1,...,S. We first show that the base case s = 1 holds with 
probability at least 1 — 6/S. Then we show that if the claim holds for s — 1, then it holds for s with probability at least 
1 — <5/5 over all random events in stage s. Therefore, using a union bound over s = 1, . . . , S we get that the claim 
holds simultaneously for all s with probability at least 1—6. 

For the base case s = 1 note that x* 6 A\ by definition, and thus jtti(^i) < holds. Moreover, using © we 

obtain that 

fii(x*) — n(x*) < C\ and fJ-(x~i) — < C\ 

holds with probability at least 1 — 6/ S. Since n(x*) — li(xi) < by definition of x*, we obtain 

as required. We now prove the claim for s > 1 using the inductive assumption 

x* € As-i and < fi, a -i(x*) — ju s _i(ai s _i) < 2C s _i . 

The inductive assumption directly implies that x* € A s . Thus we have Jli(x s ) < fi s (x*), because x s minimizes fi, s 
over a set that contains x*. The rest of the proof of the claim closely follows that of the base case s = 1. □ 

Now, for any s = 1 , . . . , S and for any x <G A s we have that 

fx(x) - n{x*) < %-i(x) - fi(x*) + Cs-i by © 

< ~p, s -i(x s -i) — fJ,(x*) + 3C s ~i by definition of A s -i, since x £ A s C A s _i 

< Ji s -i(x*) — (i(x*) + 3C s -i since x s -i minimizes /Xs-i in A s ^i 

< 4C s -i by © 

holds with probability at least 1 — 6/S. Hence, recalling that 

T 

t=l 

holds deterministically, the regret of the player over the T plays can be bounded as follows 

T S 

KS + - M*) = ^ + E TXT E M*) - ^) 

t=l 8 =1 |/ls| zeA s 

= ^ + 1 f>(») - /*•) + E TXT E - 

f=l s=2 1 s| i£A s 



< #S + T 1( tt* + £ 4T S W 112(6/c) J In ^ 

i=2 



+ Ti/Z + A\lll2(b/c)K\n ^ 



Now, since Ti = >/T, T s / y/T s -i = VT and 5" = O (in In T) , we obtain that with probability at least 1 - 6 the regret 
is at most of order 

K In InT + // VT + J KT (in 4 + In In In T 



as desired. □ 
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B Proofs 



B.l Proof of Lemma H 

To show this, we use the chain rule for relative entropy, which implies 

T 



Dkl(S I) Q) = ^D KL (St_i || Qt-i) 



(10) 



where § t _i and Qt-i denote the distributions of the player's loss L t (x t ) conditioned on L-y, . . . , L t -\, when the joint 
distribution of L\, . . . , Lt is, respectively, S and <Q>. 

LetusfocusonaparticulartermDKL(§t-i || Qt-i) and a particular realization of the random losses Li, . . . 7 L t -i- 
Since we assume a deterministic player strategy, for any such realization the player's choices xi, . . . , Xt are all deter- 
mined, and we deterministic ally have that the player either switched or not at time t. If she did not switch, then L t (xt) 
is distributed as Lt-i{xt-i) + £t under both measures §t-i an d Qt— 1» so the relative entropy between them is zero. If 
the algorithm did switch, then L t (x t ) is distributed as L t -i(x t -i) — |e| + £ under §t_i (where the switch is towards 
the best action), and as L t -i(xt-\) + |e| + £ under Qt-i (where the switch is towards the worst action). Hence, the 
relative entropy is the same as two standard Gaussians whose means are shifted by 2|e|, namely 2e 2 . So overall, we 
can upper bound Eq. (fT0T > by 



2f- 



,t=i 



e > 



(11) 



Using a similar argument, we also show that Dkl (Q S) is upper bounded by Eq. (fTTT i in which the conditioning on 

l 1 2 

e > is replaced by e < 0. Then, Pinsker's inequality implies that \S(A) — Q(A)\ is at most 



t=i 



e > 



e < 



r 2 E 



t=l 



which gives the desired bound. 



B.2 Proof of Lemma |3] 



c < E[Y] = P{Y > c/2)E[y | Y > c/2] + P(Y < c/2)E[Y | F < c/2] 
< P(Y > c/2)6+ (1 - P(Y > c/2)) c/2 

Solving for P(Y > c/2) gives the desired result. 

B.3 Proof of Thm.H 

Thm.[T]guarantees that given any player's strategy, there is some deterministic adversary strategy with a lower bound 
on the regret. However, as part of proving Thm. Q] we actually showed that there exists some randomized adversary 
strategy {ft}f—i with memory size 1, such that for any (possibly randomized) player strategy X±, . . . , Xt, 



E 



/]MXt-i,Xt) - rninyj/j(a;, x) 



> J_ T 2/3 
~ 10 



(12) 



(see Lemma[2]i. We now use this strategy to define a randomized adversary strategy for our setting (with memory size 
2), for a game of T + 1 rounds. We let fi(xi) = for any X\, f2( x i: ^2) = fi( x i)> an d f° r every t = 3,...,T + l, 



f t (x t -2,X t -l,Xt) = f t -l(x t -2,X t -l) 



(13) 
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Now, suppose we had some (possibly randomized) player strategy X\, 
player and adversary strategies, we have 



T+1 



ft(x t -2, x t _i, x t ) 



T+1 

mm V f t (x,x,x) 

xGA * — ' 



t=l 



,Xt+i, so that in expectation over the 



< J_ T 2/3 

** 10 



In particular, since f\ is always 0, it would imply that 



E 



By Eq. ( fT3l ). this implies 



'T+1 

E 

t=2 



E 



T+1 



f t {X t -2,Xt-i,Xt) - miri S_ \ f t (x,x,x) 

x£A * — ' 



T 

E 

t=i 



ft(X t _ u X t ) 



T 

min > 

xEA^ 

t=i 



J_ T 2/3 

10 



ft{x,x) 



< J_ T 2/3 
10 



Thus, if we could implement the player strategy X± , . . . , Xt in the bandits-with-switching-cost setting, it will con- 
tradict Eq. ( TTZl i. To see that this indeed can happen, note that each X t is a (possibly randomized) function of 
Xi, . . . ,Xt-i as well as {/ r (X r _2, X T -\, X T )} t T ~l} 1 . But again, due to Eq. ( fT~3T > and the fact that /i is always 0, 
Xt can in fact be defined using X\, . . . , X t ~\ and 

{fr(X T -2,X T _i,X T )} T=2 = {/ r _i(X r _ 2 , X t _i)} t=2 . 

The right hand side is an observable quantity in the bandit setting: In each round t, we know what are the set of 
losses {/ T _i(X r _2, X r _i)}*"L 2 that we obtained. Thus, we can simulate the strategy Xi, . . . , Xt in the bandit-with- 
switching-cost setting, and get an expected regret smaller than ^T 2 / 3 , contradicting Eq. (fTZt . Thus, the expected 
regret (for a game of T + 1 rounds) must be at least ^T 2 / 3 . Substituting T instead of T + 1, we get that the expected 
regret for a game with T rounds is at least i(T - 1) 2/3 > 0.03T 2/3 . 

The regret bound we just now obtained is in expectation over the randomized adversary strategy, and holds for any 
player's strategy. We now use the same line of argument as in the last part of Thm. |Tfs proof, to show that for any 
(possibly randomized) player's strategy, there exists some deterministic adversary strategy, with a similar expected 
regret bound, and with losses of bounded drift. Specifically, Lemma [3] implies that if the random variable Y is the 
expected regret E p i ayel -[Rt] over the player's actions, then 



J player 



[R T ] > 0.015 T 2/3 ) > 0.0075 T~ 1/3 



Moreover, analogous to Eq. ©, the probability of the loss drift being at most y/2 log(T 4 / 3 /0.007) < ^41og(T) + 9 
is at least 1 — 0.007 T" 1 / 3 . Thus, the intersection of the two events is not empty, and this implies that there exists 
some deterministic adversary strategy with expected regret > 0.015 T 2 / 3 and loss drift at most yj 4 log(T) + 9. 



B.4 Proof of Corollary ffl 

Part 1 follows by running Exp3 with the bandit reduction, which transforms an oblivious loss sequence £i, £2 into a 
nonoblivious loss sequence /{, f%, For any action x E A we have 

Ej2(£t(X t ) - £ t {x)) = Ej2(tt(X t ) - l t -x{X t _r) + 1 + D- £ t (x) +£ t _ 1 (X t _ 1 ) - 1 - d) 
t t 

= Ej2(ti(X t -i,X t ) - fl(X t ^x)) . 
t 

The term in the right-hand side is the standard regret against a nonoblivious adversary, which using Exp3 is bounded 
by 0(y/T). 

Part 2 follows again by using Exp3 for the strategy A and applying the abovementioned standard regret bound 
against nonoblivious adversaries. 
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